#devtools::install_github("timelyportfolio/parcoords")
library(plotly)
library(tidyverse)
library(RODBC)
library(RODBCext)
library(gsubfn)
library(readr)
library(infuser)
library(viridis)
library(gridExtra)
library(oce)
library(RColorBrewer)
library(lubridate)
library(gganimate)
library(animation)
#magickPath <- shortPathName("C:\\Program Files\\ImageMagick-7.0.5-Q16\\magick.exe")
#ani.options(convert=magickPath)
library(knitr)
library(scales)
library(parcoords)
set.seed(1000)

1. Introduction

I chose to analyze email transactions for company X. Emails are the primary means of reaching company’s members to:
• promote new and popular content
• create awareness of new products and features
• drive traffic to sponsored programs
Per CAN-SPAM law, Company X must provide a means for members to opt out of email communication.
Emails for Company X are grouped into 6 distinct business areas. Some of the business areas require “opt in” to receive email – the “subscription newsletters” (Red, Orange, Pink). For other business areas, company membership indicates “permission” to email (Green).

Members self-identify in one of 29 different clinical specialties.

To protect the confidentiality of the data, the clinical specialties have been replaced with animal names and the business areas, with color names.

Big questions:

1- Is there a difference in email performance (open, click and opt out rates) between the 6 different business areas?
2- Is there a difference in email performance (open, click and opt out rates) based on member’s clinical specialty?
3- Is there a correlation between open rates, click rates and opt out rates?

Glossary of Terms:
Send Job: A batch of emails, same message targeting a number of users, identified by message_id.
Message :An email or message that is send to a user. There are 3 actions associated with a message: a message can be delivered or opened by a user or a user can click a link on the message. If the link clicked is to opt out of email communication, that action is of a special interest therefore I am defining a 4th action on a message and call it opt out. For a send job I can determine how many messages were delivered, how many opened, how many clicked and how many users click the link to opt out.
Open Rate (opened) : Percentage of how many delivered messages were opened by users (opened*100/delivered)
Click Rate (clicked) : Percentage of how many delivered messages registered at least one click (clicked*100/delivered)
Opt Out Rate (opt_out): Percentage of how many delivered messages registered a click on the opt out link (opt_out*100/delivered)

2. Team

Due to confidentiality concerns, I conducted this analysis by myself
Stages of the project:
1- Prepare the initial data
(*) For reasons of confidentiality I won’t be showing the SQL code used to obtain the initial data
2- Import data into RStudio
3- Apply different data visualization techniques, as well as reshaping, combining, filtering, modifying, grouping and summarizing the data. The code applied to the initial data will be visible in this report.
4- Iterate number 3 until interesting visualizations or good stories were found.
3- Select the plots that best display the data analyzed in relationship to the big questions
4- Add animations and interactivity to some plots.
5- Write report.

3. Analysis of Data Quality (Preparation of the initial data)

I obtained email transaction data for this analysis from a table containing over 7.5 billion rows recording email transactions since 2013-01-01. I extracted Message ID, User ID , Action ID, Click ID from the email transaction table. The number of delivered messages and clicked messages are fairly reliable measures. However the number of opened messages is not 100% accurate. To determine if a message is opened the system relies on the ability to download an image onto the the user’s device. If a user device blocks images download, the tracking image won’t load and the action won’t be registered.

I extracted “descriptive” data from other tables: For user specialty I connected email transactions to user data through User ID, for message descriptors I connected email transactions to message data through Message ID and for URL to determine if the user clicked a link to opt out, I connected email transactions to click data.
Action ID was converted to action of type delivered, opened, clicked, out_out.
The data was filtered to include USA users of known specialty and send jobs were restricted to the past 12 months with at least 1 user opting out.
I aggregated the data to reduce the number of rows. I grouped transactional data by message_id (send job), action (delivered, opened, clicked, opt_out), user specialty and calculated the number of unique users in each group.

After this preliminary process the data is ready to import into RStudio.
To preserve In RStudio I added new variables:
Open Rate: open_r=opened*100/delivered
Click Rate: click_r=clicked*100/delivered
Opt Out Rate: opt_out_r=opt_out*100/delivered
In R I created a data frame “mail_summary_with_details” with the following characteristics:

#save(mail_summary_with_details,file="mail_summary_with_replacement_2.Rda")
load("mail_summary_with_replacement_2.Rda")

mail_summary_with_details<-mail_summary_with_details %>% mutate(specialty=reorder(specialty,total_messages,sum), action=factor(action,levels=c("delivered", "opened", "clicked", "opt_out"))) %>% filter(!(is.na(specialty)))
column_names<-mail_summary_with_details %>% names() %>% paste(collapse=", ")
total_unq_messages<-mail_summary_with_details %>% distinct(message_id) %>% count()
total_rows<-mail_summary_with_details %>% count()

The total number of rows is: 753330
The total number is unique messages is: 8234
The column names are: message_id, message_description, send_date, business_area, newsletter_specialty, specialty, action, device, total_messages

Here is a 5 row sample of the table:

mail_summary_with_details %>% head(5) %>% kable(format = "html", table.attr = "id=\"mytable\"")
message_id message_description send_date business_area newsletter_specialty specialty action device total_messages
685900168 01_23_2017_mp_f_tuesday_card_usmd_HTML 2017-01-24 14:06:00 Orange Cardiology Camel opened Smart Phone 3
232650123 Critical Images US MD CSPEC DEVELOPMENT 2016-06-17 18:06:00 Pink NONE Snake delivered NA 14766
284900278 CRPS_ 0224_ISU_INT_130_Lifestyle_D18_USMD_M_20161028 2016-10-28 16:27:00 None NONE Rabbit opened Desktop 81
242650234 07_27_2016_bn_news_alert_activity2_default_HTML 2016-07-27 20:26:00 Orange NONE Butterfly opened Smart Phone 206
242900015 20160707_ReferAColleague_USMD 2016-07-28 07:07:00 Blue NONE Chicken opt_out Smart Phone 2

4. Executive Summary

Analysis of message performance by Business Area

As the data graphs below show, user engagement with a message received varies depending on the business area associated with the message. On the left side you can see bar charts showing the values on the variables of interest: delivered, opened, clicked and opt_out by business area. Because the number of messages sent to users are not evenly distributed by business area, it is hard compare how well they perform across different groups.I resorted to use rates as a better measure of comparing them. On the right side you can see bar charts showing rates of opened, clicked and opt out messages by business area. I chose to free the y axis to be able to see details in the plots and I ordered business areas from the one sending the most messages to the one sending the least messages.

General description:
About 400 million messages were delivered with very different volumes per business areas
Open rate per business area is between 12% and 22%, Click rate is very variable from 0.4% to 4%, Opt out rate is also variable in a range between 0.03% and 0.08%.
Analysis
50% of the volume of delivered messages is generated by Orange.
Orange open rate is near 20% , click rate is the highest, near 4% and opt out rate is the lowest. Orange has a subscription based email policy. It can be concluded that the messages associated with Orange are the best performers across business areas.

In contract Red occupies second in volume of delivered message nearing 20%, but click rates is the second lowest and opt out rate is twice as large as that of Orange. Red combines subscription based and unsolicited messages which might explain the difference in performance.

Green business area is low in volume of delivered message, but the highest in opt out rates. Green does not have a subscription based or “opt in”" email policy

# Calculate Open Rate, Clicks Rate and Out out Rates
 
bussiness_area_message_rate <-mail_summary_with_details %>% mutate(business_area=reorder(business_area,total_messages, sum)) %>%  select(business_area,action,total_messages)  %>% group_by(action,business_area) %>% dplyr::summarise(messages= sum(total_messages))  %>% spread(action,messages,fill = 0)  %>% mutate(open_r=opened*100/delivered, click_r=clicked*100/delivered, opt_out_r=opt_out*100/delivered) %>% mutate(delivered=delivered/1000000, opt_out=opt_out/1000 , opened=opened/1000, clicked=clicked/1000) %>%  select(business_area, delivered,open_r ,click_r , opt_out_r, opened, clicked, opt_out) %>% gather(key=action, value=messages, -business_area) %>% mutate(action = factor(action, levels=c("delivered","opened", "clicked", "opt_out", "open_r" ,"click_r","opt_out_r")))
facet_label<-c("delivered"="Sends in Millions",opened="Opens in Thousands", clicked="Clicks in Thousands", "open_r"="Open Rate", "opt_out"="Opt outs in Thousands", click_r="Click Rate", "opt_out_r"="Opt Out Rate")

bussiness_area_message_rate%>% ggplot() + geom_bar(aes(x=business_area ,y=messages, fill=business_area), stat="identity", position = "dodge") +facet_wrap(~action , ncol=2, scales="free", dir = "v", labeller = as_labeller(facet_label)) + coord_flip() + guides(fill = guide_legend(nrow = 1,reverse=TRUE)) +  theme_minimal() +theme(axis.text.x = element_text(size =6), axis.text.y = element_blank(), panel.grid.minor.y=element_blank(), panel.grid.major.y=element_blank(),legend.position = "bottom", legend.title = element_blank(), legend.direction = "horizontal", legend.text = element_text(size = 9), legend.key.size = unit(8, "point"),  strip.text=element_text(size=12, hjust = 0))+ scale_fill_manual(values=colorRampPalette(brewer.pal(10, "Paired"))(10)) + labs(title="Metrics by Business Area", x="", y="") 

Analysis of message performance by User Specialty

The plots below show that users also react to messages differently based on their specialty. For this set of plots I chose to focus the reader’s attention on opt out rate, the far right graph. The number of total opt outs is written on the bar, this way readers have a tool to disregard high opt out rates when the real number is insignificant.

You can observe that users of type Monkey get the most emails by far and they have one of the lowest opt out rates of all specialties. Though it is harder to notice, the two middle bar charts also show that the open rate and click rate of users Monkey are on the low side. This shows that Monkey are ignoring messages at a highest rate, yet choosing not to opt out as often. It will be of great value to understand why Monkey users behave this way. They represent, after all, the biggest group and the most important audience for the company. Understanding the behavior of users in the Monkey group will help improve company’s performance.

Users of type Panda have the third highest opt out rate but the lowest number of delivered message, so low that you can hardly see it in the “Sends in Millions” graph (far left), therefore we can ignore it. The behavior of users of specialty Chicken is more concerning though. Their opt out rate is the second highest with about 5,400 users opting out in 12 months.

facet_label<-c("delivered"="Sends in Millions",opened="Opens in Thousands", clicked="Clicks in Thousands", "open_r"="Open Rate", "opt_out"="Opt outs in Thousands", click_r="Click Rate", "opt_out_r"="Opt Out Rate")
message_specialty_action<-mail_summary_with_details %>% group_by(specialty,action) %>%dplyr::summarise(total_messages=sum(total_messages))%>% ungroup() %>%  spread(action,total_messages,fill = 0) %>% filter(delivered!=0) %>%  mutate(open_r=opened*100/delivered, click_r=clicked*100/delivered, opt_out_r=opt_out*100/delivered) %>% mutate(specialty=reorder(specialty,opt_out_r))%>% mutate(delivered=delivered/1000000, label=opt_out) %>%  select(specialty, delivered,open_r ,click_r , opt_out_r, label) %>% gather(key=action, value=total_messages, -specialty, -label) %>% mutate(label=if_else(action == "opt_out_r", prettyNum(label, big.mark=","), ""), action = factor(action,levels=c("delivered","open_r","click_r","opt_out_r")))

message_specialty_action %>% ggplot(aes(x=specialty, y=total_messages, fill=specialty)) + geom_bar(stat="identity") + geom_text(aes(label = label), hjust="right") + facet_wrap(~action, nrow = 1, scales="free_x", labeller = as_labeller(facet_label)) +coord_flip() +  theme_minimal()  +theme(axis.text.x = element_text(size =6), panel.grid.minor.y=element_blank(), panel.grid.major.y=element_blank(),legend.position = "bottom", legend.title = element_blank(), legend.direction = "horizontal", legend.text = element_text(size = 9), legend.key.size = unit(8, "point"),  strip.text=element_text(size=12, hjust = 0))+ scale_fill_manual(values=colorRampPalette(brewer.pal(6, "Accent"))(29))+ guides(fill = guide_legend(nrow =2,reverse=TRUE))+ labs(title="Metrics by Specialty, with opt out value wrttten on the bar in the far right graphr", x="", y="") 

5. Main Analysis

Scatter plots

The first Scatter Plot shows the relationship between the number of delivered messages and the number of opened messages. The second scatter plot shows the relationship between the number or delivered messages and the number of opt outs. Each circle in the plot represents a send job. I also added a smooth line to help determine overall trends.
In both cases I see a positive, somehow strong linear relationship between the two variables (Sends vs Opens and Sends vs Opt outs), I can corroborate some of the findings described in the Executive Summary chapter. I see many Red and Green above the smooth line for Sends vs. Opt outs. I see that Red tends to have relatively small batches of sent messages (less than 170,000 users per batch) but I see two outliers with more than 500,000 users in batch that caused about 600 users each to click an opt out link, way above the average opt out rate.
The plots are very dense in the lower third of the graph and it is hard to see what is going on when send jobs are small to medium in size.

(*)Removed outlier message_id=210650080 for better visibility

message_action <- mail_summary_with_details %>% mutate(business_area=reorder(business_area,total_messages, sum))%>% filter(message_id!=210650080)%>% filter(action!="clicked")%>% group_by(message_id, action, business_area) %>%  dplyr::summarise(total_messages= sum(total_messages))  %>% spread(action, total_messages, fill = 0) %>%  gather(key=action, value=total_messages, -message_id, -delivered, -business_area)  %>%  mutate(action = factor(action, levels=c("opened" ,"clicked" , "opt_out")))

facet_label<-c(delivered="Delivers (in 10,000 messages)",opened="Opens", "opt_out"="Opt outs")
message_action %>% mutate(delivered=delivered/10000)%>% ggplot(aes(delivered,total_messages)) + geom_point(aes(col = business_area), shape=21, alpha=0.8) + geom_smooth(color = "black", size = 0.5, se=FALSE) + facet_wrap(~action,ncol = 1,  scales='free_y', labeller = as_labeller(facet_label),switch='y') + scale_color_manual(values=colorRampPalette(brewer.pal(10, "Paired"))(10)) + theme_minimal() + theme(legend.position = "bottom", legend.title = element_blank(), legend.direction = "horizontal", legend.text = element_text(margin = margin(2, unit = "pt")),strip.placement = "outside", axis.title.y=element_blank(), axis.title.x=element_text(size=10)) + guides(colour = guide_legend(reverse = TRUE,nrow = 1, override.aes = list(shape = 19, alpha=1))) + labs(title="Sends (in 10,000 messages) vs. Opens and Opt outs",x = "Sends (in 10,000 messages)")

Parallel coordinates chart

I added an interactive parallel coordinate chart to be able to analyze different variables at the same time. I changed the coloring variable, reordered columns, sliced the data, worked with a sample set and full set, filtered the data to remove outliers, all of this to try to find relationships or patterns in the data. For this report I decided to use quantile= ntile(opened_r,4) as the coloring variable. The ntile function allowed me to split the data is 4 groups of identical size based on the value of opened_r (open rate) variable. I also took a sample to 500 rows to improve performance.
It was hard to find dependencies and correlations in the data. However, one thing stood out, the open rate for the second group showed a narrow range around the 20% mark.
Sampling the data had a negative effect. The first group is mostly imperceptible since most of the sampled data in that group has zero open rate.
Here is the plot. Feel free to explore on your own.

message_specialty_rates_sample<-mail_summary_with_details %>% mutate(business_area=reorder(business_area,total_messages, sum),specialty=reorder(specialty,total_messages, sum))%>% group_by(message_id, business_area,action, specialty) %>% dplyr::summarise(messages=sum(total_messages)) %>% spread(action,messages, fill=0) %>% ungroup()  %>% mutate(opens_r=opened*100/delivered, opt_out_r=opt_out*10000/delivered, click_r=clicked*10000/delivered) %>% ungroup() %>% select(-message_id) %>% filter(click_r<5000, opens_r<100, opt_out_r<200) %>% sample_n(500)%>% mutate(quantile=ntile(opens_r, 4))

  parcoords(message_specialty_rates_sample,
    rownames = F # turn off rownames from the data.frame
    , brushMode = "1D-axes"
    , reorderable = T
    , queue = T
    , color = list(
      colorBy = "quantile"
      ,colorScale = htmlwidgets::JS("d3.scale.category10()")
    )    
  )

Violin plots with jitters

I was curious to see the distribution of some variables by business area. I though violin plots will be a good visualization tool for this purpose.

Each circle in the jitters represents a send job. We can estimate the number of send jobs by the density of the jitters.

The first set of violin plots shows open rates by business area. You can see business areas with low open rate variance (Example: Orange and Red) and high open rate variance (Pink and Blue). The plots also show some bi-modal distribution which will be interesting to explore further.

The second set of violin plots shows opt out rate by business area. The plots show more variation of the mean of each distribution, indicated by the red dot. Some business areas have very low opt out rate mean such as Red, some have very high, such as Blue - outside the plot area. In both plots I used the coord_cartesian function in ggplot to remove outliers

message_summary_rates<-mail_summary_with_details %>% mutate(business_area=reorder(business_area,total_messages, sum))%>% group_by(message_id, business_area,action) %>% dplyr::summarise(messages=sum(total_messages)) %>% spread(action,messages, fill=0) %>% ungroup() %>% mutate(opens_r=opened*100/delivered, opt_out_r=opt_out*100/delivered, click_r=clicked*100/delivered)


message_summary_rates %>% ggplot(aes(x=factor(business_area, levels = rev(levels(business_area))), y= opens_r))+
geom_point(position="jitter", aes(col=business_area), shape=21,alpha = 0.2,show.legend = FALSE) + geom_violin(trim=FALSE, alpha=0) + stat_summary(fun.data=mean_sdl,
geom="pointrange", color="red") + scale_color_manual(values=colorRampPalette(brewer.pal(10, "Paired"))(10)) + labs(title="Violin Plots showing density distribution of Open Rate by Business Area",x="", y = "Open Rate (per 100 Sends)") +
theme_classic() +theme(axis.text.x = element_text(angle=30,hjust=1)) + coord_cartesian(ylim=c(0,50))

message_summary_rates %>% ggplot(aes(x=factor(business_area, levels = rev(levels(business_area))), y= opt_out_r))+
geom_point(position="jitter", aes(col=business_area), shape=21,alpha = 0.2,show.legend = FALSE) +
geom_violin(alpha=0, adjust = 32) + stat_summary(fun.data=mean_sdl,
geom="pointrange", color="red")+ scale_color_manual(values=colorRampPalette(brewer.pal(10, "Paired"))(10)) + labs(title="Violin Plots showing density distribution of Opt Out Rate by Business Area",x="", y = "Opt outs Rate (per 10,000 Sends)") +
theme_minimal() +theme(axis.text.x = element_text(angle=30,hjust=1)) + coord_cartesian(ylim=c(0,0.8))

Histograms

In an effort to find what causes bimodalily in the open rate distribution, I drew the open rate frequency distribution with the help of the histogram function in ggplot. I used animation to explore how different business areas contributed to the overall histogram. The graphs show that the business areas responsible for the higher number of messages have a well defined bell curve for their open rate frequency distribution with low variance. On the other hand, the business areas that generate the least amount of messages, are less predictable with high variance.
It is worth noting that the business areas with higher mean correspond to business areas with subscription based email policies (Orange, Red, Pink) and the ones with lower mean correspond to business areas that send messages to any user who has not opted out(Light Blue and Green).

I had difficulties with the library gganimate, the flag cumulative = TRUE was not working. To compensate for that, I chose to add the full distribution in the background with low opacity.

histogram_data<-mail_summary_with_details %>% inner_join((mail_summary_with_details %>% filter(action=='delivered' & total_messages >100) %>% group_by(message_id) %>% dplyr::summarise(delivered=sum(total_messages))), by ="message_id") %>% mutate(business_area=reorder(business_area,total_messages, sum)) %>% filter(action=='opened') %>%group_by(message_id,business_area, delivered) %>% dplyr::summarise(opened=sum(total_messages) )%>% mutate(opens_r=round(opened*100/delivered,2))  

p2<-ggplot(histogram_data,aes(x=opens_r, cumulative=T)) + geom_histogram(aes(fill=business_area),bins=1000,alpha=0.2 )+ geom_histogram(aes(fill=business_area,frame=business_area, cumulative=TRUE),bins=1000 ) + theme_minimal() +  scale_fill_manual(values=colorRampPalette(brewer.pal(10, "Paired"))(10)) +coord_cartesian(xlim = c(0,50))  + guides(fill=FALSE) + labs(title="Histogram showing Open Rate Frequency Distribution\n\n", x="Open Rate", y="Open Rate Frequency")

gganimate(p2)    
ggplot(histogram_data,aes(x=opens_r, cumulative=T)) + geom_histogram(aes(fill=business_area),bins=1000,alpha=0.2 )+ geom_histogram(aes(fill=business_area),bins=1000 ) +facet_wrap(~business_area, nrow=2) + theme_minimal() +  scale_fill_manual(values=colorRampPalette(brewer.pal(10, "Paired"))(10)) +coord_cartesian(xlim = c(0,50))  + guides(fill=FALSE) + labs(title="Histogram showing Open Rate distribution\n\n", x="Open Rate", y="Open Rate Frequency")

Heatmaps

I had explored how different metrics were affected by business area and by specialty. With a heatmap I am hoping to get some insight on how both variables affect opt out rates. The plots are dense with information therefore I used an interactive heatmap be able to zoom in and see details. The color of the tile indicates the opt out rate value - dark blue represents the lowest and yellow, the highest rate. Opt outs weight more if the number of delivered emails is small. I chose to enhance the information provided in he tooltip when hovering over a tile. This allows the person interacting with the plot to judge if the opt out rate has merit or can be ignored.

I found that it was more informative to scale the values by columns and rows separately. It helps highlight difference in variable performance much better

The plots are read differently. In the first case the opt out rate is scaled for each specialty individually, therefore you focus in one column (specialty) and see how users in this group interact with messages from different business areas. In contrast, in the second plot the rows are scaled individually and the plot is read by row. Columns and rows were sorted by the average of out out rate. This helps the reader to see areas with high performers and areas with low performers.

Let me give an example on how to read first plot using the Chicken specialty. Users with this specialty perform the worst with messages from business area Blue and the best for business area Orange.

It is worth noting that most specialties perform best when interacting with messages from business area Orange and worse with business area Green, a fact that can be seen in the Executive Summary with a different type of graph.

A similar analysis can be done for Open Rate and Click Rate to evaluate how business area and specialty contribute to this variables. I are not including these extended analysis in this report.

specialty_business_area_opt_out_scaled<-mail_summary_with_details %>% dplyr::filter(!is.na(specialty))%>% group_by(specialty, business_area ,action) %>% dplyr::summarise(messages= sum(total_messages))%>%spread(action,messages,fill = 0)  %>% mutate(opt_out_r=opt_out*100/delivered) %>% mutate(delivered=delivered/1000) %>% ungroup() %>% mutate(specialty=reorder(specialty,desc(opt_out_r),mean),business_area=reorder(business_area,desc(opt_out_r),mean)) %>% mutate(Sends = paste(delivered*1000, "\nOpt outs:", opt_out, "\nOpt out rate:", round(opt_out_r,2))) %>% arrange(specialty,  opt_out_r) %>% group_by(specialty) %>%  mutate(by_specialty=rescale(opt_out_r)) %>%ungroup() %>%  group_by(business_area) %>%  mutate(by_business_area=rescale(opt_out_r))

q<-specialty_business_area_opt_out_scaled %>%  ggplot(aes(y=business_area,x= specialty, fill = by_specialty, label=Sends)) + geom_tile(show.legend = TRUE) + theme_minimal() + theme(axis.text.x = element_text(size = 6, angle = 30, hjust=1), axis.text.y = element_text(size = 6,angle = 30, hjust=1), axis.title=element_blank(), plot.margin = unit(c(t = 0, r = 50, b = 80, l = 5), "pt")) + scale_fill_viridis() +labs(fill="", title= "Heatmap with opt out rate scaled by column (by specialty)") 
ggplotly(q, tooltip = c("y", "x", "label"))
q<-specialty_business_area_opt_out_scaled %>%  ggplot(aes(y=business_area,x= specialty, fill = by_business_area, label=Sends)) + geom_tile(show.legend = TRUE) + theme_minimal() + theme(axis.text.x = element_text(size = 6, angle = 30, hjust=1), axis.text.y = element_text(size = 6,angle = 30, hjust=1), axis.title=element_blank(), plot.margin = unit(c(t = 0, r = 50, b = 80, l = 5), "pt")) + scale_fill_viridis() +labs(fill="", title= "Heatmap with opt out rate scaled by row (by business area)") 
ggplotly(q, tooltip = c("y", "x", "label"))

6. Conclusion:

With this analysis I was able to establish that:
1- Email performance depends on the sender (business area) and the recipient (user specialty).
2- There is a strong correlation between the size of the email batch and the number of messages opened, clicked and those causing users to opt out.
3- Open rates are quite predictable: higher for business areas with subscription based policies and lower for unsolicited emails.
4- The vast majority of send jobs are originated in the Orange business area and overwhelmingly target users of Monkey specialty.

Recommendations

1- The company should have clear strategies to increase open rates and reduce opt out rates.
2- Since performance depends on sender and recipient, each business area should have their tailored strategy to contact users based on their specialties. The heatmaps are a great tool to help them with the task. For example we had established that users in group Chicken have a very high opt out rate. If you look at the heatmap (by specialty) we see that business areas contribute to the opt out rate differently. The highest contributor for Chicken is the Blue business area but it can be ignored because the opt out value is very low. However the Red business area has a very high opt out rate for Chicken. Therefore the Red business area should have a specific strategy to reduce opt out rate of users with specialty Chicken.
3- Unsolicited emails increase the chances of users opting out of email communication, therefore business areas sending unsolicited emails need to be more strategic and find an optimal balance to achieve their goal without causing the company to loose the ability to contact users via email.

Pain points

I found the data set hard to analyze and I struggled with finding a good story to tell. I looked at the data from many angles, using a variety of data visualization techniques and most of the work I generated didn’t make it to the report because it was not showing any relevant information.
The biggest problems I found was the size of the data set- extremely large and the extreme values of the data. Some business areas had a huge amount of send jobs whereas other had not many. Some send jobs targeted over 400,000 users and other targeted less than 500 users. Users of some specialties were targeted in huge amounts and users in other specialties were rarely targeted.
User behavior upon receiving emails was quite difficult to assess. I was able to find some interesting patterns in open rates, but clicks and opt outs were harder to analyze. Clicks are highly dependent on the number of links in the message and I didn’t have that information to take under consideration. Opt outs tend to be very small and susceptible to the size of the email batch (the number of messages sent in the send job).

Next steps

More granularity in the data might help with drawing conclusions about user behavior and message performance. Information such as attractiveness of subject line, frequency of messages, purpose of message can help with understanding message performance and might provide business areas with insights on how to make messages more attractive to users. For understanding user behavior, more data will also be very helpful. New member vs seasoned member, very active on the website vs. less active, previous email engagement, age, gender. Categorizing users by their specialty only was very restrictive.

Lessons learned

This project was an amazing opportunity to apply exploratory data analysis and data visualization techniques and I hope this report shows a variety of approaches used to tackle a complex data analysis problem.

Time spent

My primary motivation was to find a good story and searching for a good story kept me busy until the very last minute. I didn’t feel satisfied with my findings so I kept on looking.

Regrets

I wish I had a team to share the work load with, to exchange ideas on directions and ways to look at the data. I wish I had brainstorm sessions with people to identify good questions to ask out data. I would have appreciated some help with writing this report.

Final words

Not finding a good story can be the story in itself. I thoroughly enjoyed working on this project and learned a great deal about how to build good visualizations and how to interpret complex graphs. I became much more fluid with R and ggplot and I started to use techniques learned throughout the semester regularly in my current job.

Thank you for a great class which encouraged me to work really hard.